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1.   Introduction 

In  most  statistical  analyses  it  is  taken  for  granted  that  the 
family  of  the  probability  distribution  fixnctions,  say  FCyje),  may  be 
correctly  specified  on  a  priori  grounds.   Uncertainty  exists,  therefore, 
only  with  reference  to  the  values  of  parameters  G  involved  in  the 
specified  family  of  probability  distribution  functions  (p.d.f.).   In 
practice,  however,  we  are  seldom  in  such  an  Ideal  sitxiation;  that  is,  we 
are  more  or  less  uncertain  about  the  family  to  which  the  true  p.d.f. 
might  belong.   It  may  be  very  likely  that  the  true  distribution  is  in 
fact  too  complicated  to  be  represented  by  a  simple  mathematical  function 
such  as  is  given  in  ordinary  textbooks . 

In  practice  we  approximate  the  true  distribution 
by  one  of  the  alternative  p.d.f- 's  listed  in  the  textbooks.  Needless  to  say, 
we  try  to  choose  the  most  adequate  p.d.f.  with  due  thought  to  a  priori 
considerations.   The  p.d.f,  specified  by  a  convenient  mathematical 
function  is  usually  termed  the  model.   For  further  analysis  the  model  is 
identified  at  least  tentatively  with  the  true  distribution.  To  put  it 
differently,  in  the  process  of  conventional  statistical  analysis  a 
sharp  distinction  is  seldom  drawn  between  the  model  and  the  true 
distribution. 

To  avoid  the  arbitrariness  that  inevitably  occurs  in  the 
process  of  model  building,  nonparametric  statistical  methods  have  been 
extensively  developed  in  the  past  decade.   It  seems  to  me,  however,  that  these 
methods  have  not  been  used   very   successfully  in  practical  data  analysis. 
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In  fact,  most  statistical  inferences  are  based  on  some  specific 
parametric  model,  often  on  the  model  of  normal  distribution. 

In  recent  years,  however,  more  and  more  emphasis  has  been  laid 
on  the  problem  of  model  identification;  that  is,  how  to  identify  the 
model  when  it  cannot  be  completely  specified  from  a^  priori  grounds. 
The  main  purpose  of  the  present  paper  is  to  propose  and  analyze  a 
statistical  criterion  for  model  identification  in  regression  analysis. 
Our  basic  attitude  toward  the  problem  is  to  recognize  the  fact  that 
a  certain  amount  of  discrepancy  inevitably  exists  between  the  true 
distribution  and  the  model.   The  best  we  can  do  in  trying  to  cope  ;<n.th  this 
sort  of  situation  is  to  identify  the  laost  adequate  model  among  a  given 
set  of  alternatives.   The  adequacy  of  a  model  needs  to  be  quantified 
by  introducing  a  suitable  measure  of  the  distance  of  the  model  from 
the  unknown  true  distribution. 

It  is  expected  intuitively  that  the  more  complicated  model  will 
provide  the  better  approximation  to  reality.   But,  on  the  contrary, 
the  less  complicated  model  should  be  preferred  if  we  wish  to  pursue  accuracy 
of  estimation.   To  illustrate  this  point,  let  us  consider  the  situation 
where  two  alternative  density  functions  f  (-{O)  and  f_(-|?),  are  given 
as  possible  models  of  the  density  g(')  of  the  true  distribution,  where 
8  and  C  are  vectors  of  unknown  parameters.   Even  if  f^('|6)  is  the 
better  approximation  to  the  true  density  g(")  in  the  sense  that 
inf  jl  f,(-!.6)  -  g(.)il  <  Inf  li  f^(-k)  -  g(-)||  where  |j  •  |l  is  a 

suitably  defined  distance,  it  is  quite  likely  that 

E   II  f,(-|e)  -  g(-)||  >  E,  II  f,(-lb  ~  g(-)|l  if  dim  8  >  dim  ^  where 

e  t. 

6  and  t,   are  estimates  for  6  and  r   respectively. 
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The  above  consideration  leads  us  naturally  to  the  so-called 
principle  of  parsimony.   That  is,  more  parsimonious  use  of  parameters 
should  be  pursued  so  as  to  raise  ttie  accuracy  of  the  estimates  of  the 
parameters.   In  general,  closeness  to  the  true  distribution  is  incompatible 
with  parsimony  of  parameters.   These  two  criteria  form  a  trade-off.   That 
is,  if  one  pursues  one  of  the  criteria,  the  other  must  be  necessarily 
sacrificed.   The  multiple  correlation  coefficient  adjusted  for  the  degrees 
of  freedom  may  be  the  most  commonly  used  statistic  that  incorporates  the 
two  incompatible  criteria  into  a  single  statistic. 

Akaike  [1]  has  proposed  a  more  general  as  well  as  more  widely 
applicable  statistic  that  ingeniously  incorporates  the  two  criteria. 
Since  it  is  based  on  the  Kullback-Leibler   Information  Criterion, 
Akaike 's  statistic  is  called  the  Akaike   Information  Criterion  and  is 
abbreviated  as  AIC.   Indeed,  the  procedure  developed  here  is  also  based 
on  the  Kullback-Leibler   Information  Criterion,  but  the  criterion 
for  the  choice  of  a  regression  model  implied  by  our  procedure  is 
considerably  different  from  that  implied  by  AIC.   The  disagreement 
stems  from  a  difference  between  Akaike 's  and  our  views  on  the  true 
distribution. 

Ir  Section  2  we  briefly  review  the  Kullback-Leibler 
Information  Criterion  and  the  Akaike   Infortii.atlon  Criterion.   In  Section  3 
we  develop  a  criterion  for  the  choice  of  a  regression  model  and  compare 
it  with  a  criterion  implied  by  the  Akaike   Criterion.   In  Section  4 
the  Bayesian  approach  to  the  problem  is  considered  and  a  different 
criterion  is  derived  from     Bayesian  point  of  view.   The  bias 
of  the  three  criteria  is  discussed  in  Section  5. 


2.   Information  Criterion 

Suppose  that  we  are  concerned  with  the  probabilistic  structure 
of  a  vector  random  variable  Y*  =  (Y  ,  Y^,  *'*  ,  Y  ).   Let  G(y)  be  the 
true  joint  distribution  of  Y.   On  the  basis  of  a_  priori  knowledge  we 
postulate  a  Biodel  F(y|8)  to  approximate  the  unknown  true  distribution 
G(y),  where  8  is  a  vector  of  unknown  parameters. 

The  adequacy  of  a  postulated  model  may  be  measured  by  the 
Kullback-Leibler 's  Information  Criterion  (KLIC). 

(2.1)  I(G:F(.|9))    =  Eg[log|^|^y] 

where  g  and  f  are  density  (or  probability)  functions  of,  respectively, 

G  and  F;  £„(•)  stands  for  expectation  with  respect  to  the  true  distribution  G; 

the  integration  is  over  the  entire  range  of  Y.   It  can  be  easily  shown 

that  the  KLIC  is  nonnegative 

(2.2)  I(G:F(-|6))  ^  0 

with  equality  only  when  F(y|6)  =  G(y)  almost  everirwhere  in  the  possible 
range  of  Y;  namely,  only  when  the  model  is  correct.   (See,  for  instance, 
Rao  [5  ]  pp.  58-59.)   Incidentally,  the  negative  value  of  the  KLIC  is 
termed  the  entropy  o^.   a  probability  distribution  G(y)  with  respect  to  F(yJ0). 
Noting  the  inequality  (2.2)  as  well  as  an  obvious  equality 
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(2.3)  I(G:F(-|0))  =  /  log  g(y)dG(y)  -  /log  f(y|6)dGCy), 

we  are  led  to  propose  the  following  rule  for  a  comparison  of  alternative 
nodels  or  estimates. 

Rule  2.1:   (i)  A  model  F^(-|a)  is  regarded  as  a  better  approximation  to 
the  true  distribution  G('),  i.e.,  a  better  model  than  an  alternative  model 
FjC* |c)  if  and  only  if 

(2.4)  inf  I(G:F  (-Is))  <  inf  1(G:F^('\0) 

e     "       c, 

or  equivalently 

(2.5)  sup  E^  [log  f^CYle)]  >  sup  E  [log  f  (y|?)]. 

(ii)  Given  a  model  F(-l9),  an  estim/Jte  6.  is  regarded  as  a  better  estimate 
than  8  ,  if  and  only  if 

(2.6)  Eg  {E^tlog  f(Y|9^)ie^]}  >  Eg  {E^[log  f(Y|62)|62]} 

where  Et  and  E^  stand  for  expectations  with  respect  to  the  sampling 
®1      ®2 

distributions  of  §,  and  6„,  respectively.   (Note  that  when  we  first  take 

1      z- 

an  expectation  with  respect  to  G  the  estiimte  6   or  9^  should  be  treated 
as  if  it  were  a  constant.) 

It  was  pointed  out  by  Akaike  [1]  that  if  the  Y!  s  are  independent 
and  identically  distributed  the  raaximun  likelihood  estimate  may  be 
regarded  as  an  estimate  that  minimizes  the  estimated  KLIC,  or  equivalently 
maximizes  the  estimated  entropy,  because  the  log  likelihood  function 
divided  by  the  sample  size  n 

(2.7)  ^  E   log  f(v.iP) 
^-  j  =1        J 
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may  be  regarded  as  a  reasonable  estimate  for  E^[log  f(Y|e)]  whatever  G(y) 
is. 

Apparently,  the  above  rule  for  a  comparison  of  models  is  not 
directly  applicable  in  pracfi'^e,  because  the  criteria  are  totally 
dependent  on  the  unkrLOwn  true  probability  distribution.   To  establish  a 
practical  usable  criterion  for  model  id-^ntlflcation  on  the  basis  of  the 
KLIC,  we  need  to  replace  unknowns  in  (2.5)  by  their  reasonable  estimates. 
In  fact,  the  Akaike   Information  Criterion  (AIC)  has  been  derived  as  an 
approximately  unbiased  estimate  for  the  KLIC,  neglecting  its  irrelevant 
constant  terms  and  based  implicitly  on  a  fairly  strong  assumption. 

For  the  sake  of  convenience  in  developing  our  argument  we  give 
the  following  definition: 

Definition:   Given  a  model  F(«|e),  a  parameter  value  6_  such  that 

(2.8)     KG:   FC-JSq))  <  KG:  T(-\b)} 

for  any  possible  9  in  the  admissible  parameter  space  is  called  a 
pseudo-true  parameter  value. 

If  the  true  distribution  G(y)  and  a  model  F(y|e)  satisfy  due 
regularity  conditions,  the  pseudo-true  parameter  6  must  satisfy 

The  model  F(y[6  )  may  be  regarded  as  the  most  adequate  relatively  within 
the  family  of  models  F(y|e)  in  the  sense  that  the  KLIC  for  F(y|0)  is 
minimized  by  F(y|9  ). 

Assuming  that  G(y)  =  FCyje^)  almost  everywhere,  Akaikf,  [1]  derives 
his  criterion 
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(2.10)  AIC  (F(-J8))  =  -2  log  i(y|e)  +  2k 

as  an  almost  unbiased  estimate  for  -2  E   [log  f(Y|6)],  where  6  Ib  the 
maximum  likelihood  estiinate  for  3  based  on  observations  y  and  k  is  the 
number  of  the  unknown  parameters,  i.e.,  the  dimension  of  6.   The  procedure 
of  choosing  a  model  that  minimizes  the  AIC  is  called  the  Minimum  AIC 
(MAIC)  procedure.   The  first  term  of  the  AIC  measures  the  goodness  of 
fit     of  the  model  to  a  given  set  of  data,  because  f(y[9)  is  the 
maximized  likelihood  function.   The  second  term  is  interpreted  as  representing 
a  penalty  that  should  be  paid  for  increasing  the  number  of  parameters. 
The  increase  in  the  number  of  parameters  almost  necessarily  improves  the 
fit     but  only  at  the  cost  of  sacrificing  accuracy  of  estimation.   In 
this  sense  the  AIC  may  he  regarded  as  an  explicit  formulation  of  the 
so-called  principle  of  parsimony  in  model  building. 
Indeed,  the  assumption  that 

(2.11)  F(y|0p)  =  G(y) 

simplifies  the  derivation  substantially,  but  there  is  no  denying  that 
this  simplifying  assumption  lessens  the  plausiblity  of  the  AIC  to  some 
extent.   In  the  next  section,  confining  ourselves  to  a  linear  regression, 
we  derive  another  criterion  without  assuming  (2.11)  and  compare  it  with 
the  AIC  to  see  what    difference  might  arise  depending  on  whether  or  not 
we  assume  (2.11). 
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3.   Identification  cf  a  Regression  Meckel 

We  are  interested  in  investigating  a  joint  distribution  of 

a  vector  random  variable  Y'  =  (Y,  >  Y„,  •••,  Y  ),      Each  of  v.'s  niay  be 

12        n  1     ■' 

an  observation  on  a  certain  characteristic  of  a  randomly  chosen 
individual;  or  Y.'s  may  constitute  a  sequence  of  observed  time  series. 
The  distribution  function  G(y)  is  unknown,  but  each  Y.  is  assumed  to 
possess  finite  variance.   We  denote  the  mean  vector  and  the 
variance-covariance  matrix,  respectively,  by  u  and  n,  where  y  is  a 
vector  of  n  components  and  f2  is  a  n  x  n  positive  definite  matrix. 
Unless  we  place  more  a^  priori  restrictions  on  the  elementH  of  y  and  Q, 
we  can  make  no  inference  at  all  about  the  joint  distribution  of  Y. 

What  we  usually  do  is  to  assume  that  y  belongs  to  a  linear 
subspace  of  lower  dimension  than  n  and  Y.'s  are  mutually  uncurrelated. 
Then  we  have  a  familiar  linear  regression  model 

(3.1)     E(Y)  =  X6.  V(Y)  =  a^l 

n 

where  X  is  a  n  x  k  witrix  of  known  constants,  the  k  columns  of  which 
constitute  a  basis  of  the  sub^ipace  to  which  y  is  assumed  to  belong;  g 
is  a  vector  cf  k  unknowa  paraiiaters;  o   is  an  unknown  positive 
constant;  I  is  an  identity  matr;j:  of  order  n.      In  most  practical 
situations  the  columns  of  X  are  vectors  of  observations  on  certain 
characteristics  considered  to  be  associated  with  Y.   Then  the  model 
implies  that  the  i-th  m^au   y,  is  represented  as  a  linear  function  of 
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k  explanatory  variables,  i.a.,  \x.   =  E  g.x   where  x   is  the  (i,  j)-th 

element  of  x.   By  assuming  a  regression  model  we  can  reduce  the  number 
of  iinkr!Ox-m  parameters  from  n  +  n(n  +  l)/2  to  k  +  1. 

In  addition  to  (3.1)  we  often  assume  the  norioal  distribution 
for  Y,  and  postulate  a  model 

(3.2)  Y  'V  N(XB,  o-I^), 

or 

2 
Y  =  Xg  +  u,     u  -v  N(0,  a   I^)  , 

which  is  termed  a  linear  normal  regression  model. 

2 
Lemma  3.1:   The  pseudo  true  values  for  parameters  G'  =  (6',  a  )  are 

(3.3)  eg   =  (X'X)''\'v 

(3.4)  o^^   =  ^   y'Cf  -  X(X'X)"^x')y  +  ^  tv  Q. 

The  above  results  are  easily  obtained  by  solving  the  equations 

(3.5)  E[-g|  log  f(YJ0)]  =  0 

(3.6)  E[~~  log  f(Y|e)]  =  0 

3a 

where  f(y|e)  is  the  density  fui;ci:ion  of  N(XB,  a   I)  and  the  expectation  is 
with  respect  to  the  true  distribution.   Geometrically  speaking,  X,6q  is 
a  projection  of  the  unknown  mean  vector  u  into  the  space  spanned  by  the  k 
columns  of  X,  while  no  "^  is  the  sum  of  the  variances  of  the  Y  's  plus  the 
squared  length  of  the  perpendicular  from  p  to  the  space.  The  error  of 
approximating  y  by  Xt  is  absorbed  into  the  error  variance. 
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The  Diaximum  likelihooa    (ML)    estimates 

(3.7)  3  =    (X'X)~^'y.  a^   =  -^■-  y'[l   -  x(x'x)~-'-x']y 

2 
for  g  and  o  in  the  ncrmal  regression  irodel  (3.2)  have  the  following 

property. 
LffTTiTiia   3.2 

(3.8)  E(3)   =   Bq, 

(3.9)  plim  (cT^    -  <^^)    -  0,    if  n  =  co^I 

U  n 

This  lesma  implies  that  vith   an  incorrect  model  the  objects  of  our 

estimation  are  pseudo  true  parameter  values.   To  put  it  differently, 

what  we  ordinarily  caxl  the  true  pa^-ameter  values  are  the  parameter 

values  that  minimize  the  di^tc-rce  between  the  true  unknown  distribution 

and  the  postulated  parametric  model,  where  the  distance  is  measured 

by  the  KLIC.   Moreover,  it  should  be  noted  that  if  Y  's  are  uncorrelated, 

i.e.,  n  =  0)  I  ,  then  B  and  o*"  are  uncorrelated. 
n 

Along  the  lines  of  tne  previouo  section,  one  can 
measure  the  loss  incurred  by  .ic-ielling  G(y)  by  F(y|e)  with  some 
estimate  6  in  place  ii  unkr.ov:Ti  fc(  by  the  quantity 

(3.10)  W(F(-|9))  =  -  -  E„  f-^.og  f(Y!e)fe], 

where  f(y|8)  is  the  de^'.s'-'y  function  of  the  pseudo-true  model 

2 
N(X3„,  o     I),  i.e.,  tiie  likelihood  function  of  the  model.   It  should 

be  noted  that  the  expectation  on  th-^-  right-hand  side  of  ('^.10)  refers 

only  to  the  argument  Y  of  the  density  function. 
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Let?jna  3.3: 

The  loss  incurred  by  modelling  the  c-- rtribution  of  Y  by  F(y|6)  with  an 

estimated  value  8  substituLad  for  8  is  evaluated  as 

2 

(3.11)     K(F(-|e))  =  log  (lir)   +   log  (a')  +  (—;  +  -~   [j  X{&    -    B)  j|^ 

CT     na" 

where  jj  •  {j  is  the  Euclidear  norai. 

The  proof  is  given  in  the  Appendix. 

In  this  section  we  adhere  to  the  sampling  theory  apnroich,  and 

hence  we  base  our  decision  about  ir.odel  selection  on  the  risk  function 

derived  by  integrating  the  loss  functior  with  respect  to  the  sampling 

distribution  of  the  astimaLe  8.   Sf.nca  the  ML  estimate  9  possesses  the 

nice  property  in  Lexrana  3.  "2,  even  when  a  postulated  model  is  incorrect,  we 

define  t]i<^  risk  of  postulating  -i  model  F(y|6)  by  an  integral  of  the  losj 

function  of  F(y|6)  with  respect  to  the  sampling  distribution  of  the  ML 

estimate  e. 


Theorem  3.1:      Suppose   that  Q  =  w~I     and  each  Y.    is  symmetrically 
distributed  uith   the   same  Vuutosi?   as  a  ncrtoal  di;-tribution,      Tlien   the 
risk  of  a  model  F(-,0),    i.e.,    the   e:cpected  value    >f  W(F('|8)),    is 

evaluated   *-o  order  0(-\  ~")    sc 

2 
,2  2 

R(F(.|e))   =  log   v2ir)   +  log   (0^1  -h  1  +  —^  ('^)    -  ~  (~)     +  0(n"^, 

2 
The  proof  is  given  in  the  Ap;-iendix.   It  should  be  noted  that  a^     increases 

with  the  addition  of  explanatory  var'.ables,  i.e.,  the  increase  of  k. 

To  develop  a  practlc^il  and  use-ful  crit.jiion  for  model 

identification,  the  risk  function  involving  linknowii  parameters  needs 

to  be  somehow  estimated  from  a  given  set  of  observations. 
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Theorem  3.2:   Suppose  tha;  an  asymptotically  unbiased  estincite,  say 

"2       2 

u)  ,  for  CO  is  obtained  from  samo.   source  available  a  p  rioi  I  and  3 1  is 

Statistically  indepeadei,L  of  o".   Then 

2 
"2      "2 

(3.13)     BIC  (F(-Ie;)  --=  -2  log  £   v'-;^0  +  2(k  +  2){T~)   -  ■'(%) 

a      o 

is   an   asymptotically   xir.bltvsed  HStl.n.ita  of  nR(F('ie)). 

The  proof  is   given  in   the  A^'^endix.      If  we  eqiictt  i  w'    to  o    , 
the  BXC   is   ideiitictl  with  the  AIC.     As  was  pointed  out   in   tiie  preceding 
section,    the  AIC   is  based  on  the  j:;3umptlon  that   the   true   Jistributior; 

belajigs   to   the  family  of  discributior.s  specified  by  a  pojj.ulated  model; 

2  2 

namely,    fr     is  equat'^.d  to  *      in    Ihe.  process  ot    deriving  t]  e  AIC. 

The  variance  ratlt>  lo   /o^   increases  with  succestlve  addition 
of  explanatory  varxd^lcs,   aud  possilly  it  approache".  nae.      I^-s 

reciprocal  a   /o)      (>^  1)   may  be   interpreted  as   a   discounting'  factor  for 
the  penalty  that  hen   to  ba  paid  for  increasing      the  nuinat  i    of 
paramaterj;.      Therefore,   when  ve  c<.  jrare  two  regression  mi  aels,   cae 
with   less   explanatory  variables   ard  poorei   fit,    the  t-the-i 
with  noTc    e-ralan?cory  variables   and  better   fit,  the   BXC  As  more 

favorabli^-   to   the  more  parslr-unious   mode]    than   the  AIC.      Vhe  following 
numerical  evaluations  show   that   the  difference  betwiien   the  two  critar'.-j 
is  far  from  negligible. 

Let  us   dftveloo  a  decijion  rule  to  choose  onsi  f -cm  tvo 
alternative  regres5?ion  mode]  s 

F,:     Y  'v  N(X^tJ,  .   a,^I   ■) 
1  1  1-       i     n 

F.:      Y  ^  -a^^e^  +  V2'    ^/^.> 


-  13  - 


where  X,  and  X.  are  respectively  n  x  p  and  n  x  q  matrices  of  kno\TO 

constants,  3,  and  0_  are  respectively  p  x  1  and  q  x  1  vectors  of 

2       2 
vmknown  parameters,  rnd  a       and  o'   are  positive  unknowns.  The  true 

2 
distribution  is  assumed  to  be  N(vi,  w  I  ) .   In  practice,  we  cannot 

n 

2 
expect  to  obtain  an  estiinate  for  co   from  some  Independent  source. 

Therefore,  assuming  that  the  more  complicated  model  F_  is  nearly 

~  2      2     '2 

true,  we  substitute  the  ML  et^timate  a  ''  of  c„   for  to  in  (3.13).   Our 

decision  rule  Is  described  as  follows:   we  choose  F^  if  BIC  (F  )  < 
BIC  (F„)  and  vice  versa. 

It  is  straightforward  to  show  that  the  decision  rule  based 
on  the  BIC  is  equivalent  to  a  decision  based  on  the  magnitude  of  the 
F-statistic  that  is  customarily  used  to  test  the  null-hypothesis 
3=0.   That  is,  we  decide  to  clioose  F  if  an  observed  value  of  the 
F-statistic  falls  below  a  critical  point  determined  by  the  inequality, 
BIC  (F  )  <  BIC  (Fj)  and  choose  F-  otherwise.   The  critical  point 
varies  depending  on  n,  p,  and  q. 

Confining  ourselves  to  the  case  when  q  =  1,  we  tabulate  the 
critical  points  implied  by  ^he  minimum  BIC  principle,  say  MBIC  critical 
points,  in  Table  3.1.   Since  the  t-statistic  Is  more  familiar  to  us 
than  the  F-statistic  in  the  case  of  q  =  1,  these  critical  values  refer  to 
the  t-statistic.   We  decide  to  choose  F   if  the  observed  value  of 
the  t-statistic,  the  ML  estimate  of  p-  divided  by  its  estimated 
standard  deviation,  falls  below  a  critical  point  read  from  the  table, 
and  vice  versa. 

To  examine  how  much  the  MBIC  procedure  differs  from  the 
MA.IC  procedure,  the  critical  points  implied  by  the  AIC  are  also 
tabulated  in  Table  3.2.   Both  of  these  approach,  although  very  slowly, 


v2   asymptotically.   We  note  a  reiaarkable  difference,  namely  that  the 
MAIC  critical  point  approaches  /2  from  below  whereas  the  MBIC  approaches 
from  above.  Moreover,  as  the  number  of  variables  already  included 
increases,  i.e.,  as  p  becomes  larger,  the  MBIC  procedure  increasingly 
discriminates  against  the  inclusion  of  additional  variables ^  whereas 
the  converse  is  true  for  MAIC. 

To  see  a  connexion  b^^.tween  our  procedure  and  the  preliminary 
t-test,  for  some  cliosen  cases,  we  tabulate  the  level  of  significance, 
i.e.,  the  probability  that  |tj  exceeds  the  critical  point  when  F   is 
true.   Roughly  speaking,  for  moderate  values  of  p,  the  significance 
level  for  the  MAIC  procedure  varies  over  the  wide  range  from  30%  to  16% 
as  the  number  of  degrees  of  freedom  Increases;  on  the  other  hand,  for 
the  MBIC  procedure,  it  varies  over  a  relatively  narrow  range  from  10% 
to  16%.   Both  procedures  share  a  common  property  in  their  more  generous 
attitude  toward  inclusion  of  additional  variables  than  the  traditional 
preliminary  test  with  the  significance  level  5%  or  10%.   It  should  be 
noted,  however,  that  these  two  asjinptotically  equivalent  procedures 
will  very  often  lead  us  to  different  decisions  for  small  samples. 

Based  on  tne  minimax  regret  principle  with  the  squared  error 
of  prediction  as  a  loss  function,  Sax^a  and  Hiromatsu  [  6]  calculated  the 
optimal  significance  point  for  the  preliminary  t-test.   Their  minimax 
regret  significance  points  are  quite  insensitive  to  the  change  in  the 
number  of  degrees  of  freedom.   That  is,  iu  remains  constant  at  1.37  to 
two  decimal  places,  unless  the  number  of  degrees  of  freedom  is  extremely 
small,  say  less  than  10.   Indeed  it  is  difficult  to  establish  a  clear-cut 
connection  between  the  two  basically  different  approaches,  but  it  would 
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be  worth  noting  that  if  a  loss  function  is  specified  in  terms  of  the 
prediction  error,  the  more  prodigal  model  Is  likely  to  be  preferred. 
We  often  encounter  a  situation  where  we  have  to  choose  one 
of  two  unnested  alternatives: 

Y  'v.  N  (X.,B.  ,  o't^I  )  and  Y  ^  N  (X„g,,  oj^lj, 
illn  2/in 

2 
where  the  true  distribution  of  Y  is  N  (u,  lo  I  )  .   In  this  kind  of 

n 

2 
situation  the  unknown  true  variance  u  may  be  reasonably  estimated  from 

a  regression  of  y  on  all  the  explanatory  variables  X-  fX^-  Another 

2 
reasonable  estimate  of  w  may  be  the  smallest  value  of  "unbiased" 

estimates  of  variances  for  all  possible  regressions  of  y  on  a  subset 

of  X^    t  X^. 

2 
The  difficulty  in  estimating  w  does  admittedly  place  a 

serious  limitation  to  the  practical  usefulness  of  the  MBIC  procedure. 

However,  it  should  be  noted  that  the  same  difficulty  is  shared  by 

Mallow's  [  4]  procedure  which  is  based  on  what  he  calls  C   statistic. 
^  P 

Incidentally,  Mallow's  procedure  gives  a  decision  rule  essentially 
similar  to  the  AIC.   It  is  also  v7orth  noting  that  according  to  Akaike's 

procedure  uj  is  estimated  by  a,'^  when  we  evaluate  the  AIC  for  the  model 

"    2 
F^  and  by  a„  when  we  evaluate  the  AIC  for  the  model  F- .   This  means 

that,  given  a  class  of  nested  alternative  models,  the  AIC  for  each 

model  is  evaluated  assuming  it  is  true.   On  the  other  hand,  the  BIC 

for  each  model  is  evaluated  assumJ.ng  chat  the  most  complex  model  within 

the  class  vxould  be  true. 


-  16  - 


Table  3.1 
MBIC  Critical  Points  for  the  Preliminary  t-Test 


10 


10 

12 

14 

16 

18 

20 

25 

30 

50 

100 

200 

500 

1000 


1.525 

1.500 
1.484 
1.473 
1.465 
1.458 
1.448 
1.442 
1.430 
1.442 
1.418 
1.416 
1.415 


1.646 
1.591 
1.557 
1.533 
1.516 
1.504 
1.482 
1.469 
1.445 
1.429 
1.421 
1.417 
1.416 


1.816 

1.715 
1.652 
1.610 
1.580 
1.558 
1.522 
1.500 
1.462 
1.437 
1.425 
1.419 
1.416 


2.036 
1.882 
1.778 
1.709 
1.660 
1.625 
1.568 
1.536 
1.480 
1.445 
1.429 
1.420 
1.417 


2.264 

— 

2.092 

- 

1.943 

2.678 

1.836 

2.758 

1.761 

2.665 

1.707 

2.494 

1.624 

2.192 

1.576 

1.912 

1.449 

1.625 

1.453 

1.499 

1.433 

1.453 

1.421 

1.429 

1.418 

1.421 

n  is  the  sample  size  and  p^  is  the  number  of  the  explanatory  variables 
already  included  in  the  model.   The  decision  rule  is  described  as  follows: 
if  the  t-value  for  an  optimal  variable  exceeds  the  MBIC  critical  point,  we 
decide  to  augment  the  model  by  the  optimal  variable,  and  vice  versa.   Note 
that  the  MBIC  critical  point  approaches  slowly  to  ■fl   as  n  tends  to  infinity 
for  every  p. 
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Table  3.2 
MIC  Critical  Points  for  the  Preliminary  t-Test 


10 


10 

1.331 

1.245 

1.153 

1.052 

.941 

- 

12 

1.346 

1.278 

1.205 

1.127 

1.04  3 

.426 

14 

1.357 

1.300 

1.239 

1.176 

1.108 

.679 

16 

1.365 

1.316 

1.264 

1.210 

1.154 

.816 

18 

1.371 

1.328 

1.283 

1.236 

1.188 

.907 

20 

1.376 

1.337 

1.297 

1.256 

1.213 

.973 

25 

1.384 

1.354 

1.323 

1.291 

1.258 

1.000 

30 

1.389 

1.364 

1.339 

1.313 

1.286 

1.144 

50 

1.400 

1.385 

1.370 

1.355 

1.340 

1,262 

100 

1.407 

1.400 

1.393 

1.385 

1.378 

1.341 

200 

1.411 

1.407 

1.404 

1.400 

1.396 

1.378 

500 

1.413 

1.411 

1.410 

1.409 

1.407 

1.400 

li)00 

1.414 

1.413 

1.412 

1.411 

1.411 

1.407 

■/ 

n  is  the  sample  size  and  p  is  the  number  of  the  explanatory  variables 
P   already  included  in  the  model.   The  decision  rule  is  described  as  follows:   if 
the  t-value  for  an  optional  variable  exceeds  the  MAIC  critical  point,  \ie   decide 
to  augmfent  the  model  by  the  optional  variable,  and  vice  versa.   Note  that  the 
flAIC  critical  point  approaches  slov/ly  to  \^  as  n  tends  to  infinity  for  every  p. 
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Table  3.3  Significance  Levels  Implied  by  the  BIC  Procedure 


nP 

1 

2 

3 

4 

5 

10 

.1658 

.1438 

.1193 

.0974 

.0863 

12 

.1645 

.1461 

.1247 

.1019 

.0814 

14 

.1636 

.1478 

.1295 

.1091 

.0879 

16 

.1629 

.1492 

.1334 

.1155 

.0962 

18 

.1625 

.1503 

.1364 

.1208 

.1037 

20 

.1621 

.1509 

.1388 

.1250 

.1099 

25 

.1611 

.1525 

.1430 

.1326 

.1209 

30 

.  1604 

.1534 

.1457 

.1371 

.1281 

50 

.1592 

.1551 

.1505 

.1458 

.1544 

oo 

.1574 

.1574 

.1574 

.1574 

.1574 

Table  3.4  Significance  Levels  Implied  by  the  AIC  Procedure 


nP 

1 

2 

3 

4 

5 

10 

.2199 

.2532 

.2928 

.3410 

.4000 

12 

.2080 

.2332 

.2626 

.2969 

.3371 

14 

.1998 

.2202 

.2436 

.2698 

.3001 

16 

.1938 

.2109 

.2302 

.2516 

.2753 

18 

.1893 

,2040 

.2203 

.2383 

.2578 

20 

.1857 

.1988 

.2130 

.2283 

.2452 

25 

.1796 

.  1895 

.2001 

.2114 

.2236 

30 

.1758 

.1838 

.1922 

.2011 

.2107 

50 

.1679 

.1726 

.1773 

.1822 

.1871 

00 

.1574 

.1574 

.1574 

.1574 

.1574 
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4.   Bayesian  Decision  Rule 

In  this  section  we  looV  at  the  problem  another  v/ay,  from  the 
Bayesian  point  of  view.   Given  a  model  F(«|e)  coupled  with  a  prior 
distribution  P(e)  we  define  the  Bayes  risk,  say  B(e|F),  for  an  estimate 
e  by  the  expectation  of  the  loss  function  (3.10)  with  respect  to  the 
posterior  distribution,  that  is, 

(4.1)  B(e|F)  =  /  W(F(.|e))  dP(ely) 

where  P(9|y)  is  the  posterior  distribution  for  6  given  an  observation 
y.   If  there  exists  an  estimate  G   such  that 

(4.2)  B(e*iF)  =  min  B(e|F) 


then  it  is  called  the  Bayes  estimate  of  9.   Recalling  that  W(F('|6)) 
measures  the  discrepancy  of  a  model  F('J9)  from  the  true  distribution 
G(')»  we  take  B(6  JF)  as  a  measure  of  the  adequacy  of  a  model  F('|e) 
associated  with  a  prior  distribution  P(9).   That  is,  along  the  lines  of 
previous  sections,  if  v^e  compare  two  alternative  models,  say,  F  (-Is) 
with  P, (e)  and  7^i-\^)   with  Pp(?),  then  we  decide  to  choose  F^  or  F. 
according  to  whether  or  not  B(e  jF  )  <  3(5  ]F^). 

In  what  follows  let  us  be  specific  to  a  linear  normal  regres- 
sion model  for  a  vector  random  variable  Y: 

(4.3)     F:   Y  ^u  N  (X8,  ff^I  ) 

n 

where  Y  is  n  x  1,  x  is  n  x  k,  3  is  k  x  1,  and  u  is  n  x  1;  the  true 
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2  2 

distribution  of  y  is  N(u,  w  1  )  is»ith  unknowns  ]i   and  to  .   If  we  assume 

2 
a  diffuse  prior  for  8  and  a    ,    the  minimum  attainable  Bayes  risk  is 

evaluated  as  follows: 


Lenima  4.1.   Given  a  model  F  with  a  diffuse  prior,  the  minimum  attainable 
Bayes  risk  is 

(4.4)     B(B*,  a^*|r)  =  -  I  log  f  (y|3,  a^)  +   log  (^  ^  ^  ^  ^) . 

"     '7  2  ~*     ~2* 

where  3  and  o"  are  the  ML  estimates  for  |3  and  o  ,  6  and  a       are  the 

Bayes  estimates,  and  f  is  the  density  function  of 

N(X6,  oh  J. 

Let  us  make  a  comparison  of  two  nested  alternatives  F^  and  F^ 

given  in  (3.14).   The  Bayes  decision  rule,  based  on  the  magnitude  of  the 

minimum  attainable  Bayes  risk,  leads  us  to  the  following  decision  rule 

which  is  again  described  in  terms  of  a  familiar  F-statistic. 


Theorem  4.1.  A  decision  rule  based  on  the  rainimiam  attainable  Bayes  risk 
is  equivalent  to:   choose  F   if 

(4  5)     w  <  2(n  -  l)(n  -  p  -  q) 

^""'^^  (n  +  p)(n  -  p  -  q  -  2) 


choose  F-   otherwise,   vihere 

"2        "2 
(n  -  p  -  q)  (o        -  a,   ) 

(4.6)  W-  :r-7r^ ^~ 


is  a  F-statistic  conventionally  employed  to  test  the  hypothesis  that 
^2  =  0. 

The  proof  is  given  in  the  Appendix. 
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We  call  the  right-hand  side  of  (4.5)  the  Bayes  critical  point, 
which  tends  to  2  asyniptotically,  increases  with  q,  and  decreases  with  p 
if  n  is  moderately  large.  Limiting  ourselves  to  the  case  of  q  =  1, 
we  tabulate  the  numerical  values  of  the  square  root  of  the  Bayes 
critical  point  in  Table  4.1  which  is  comparable  to  Tables  3.1  and  3.2. 

Table  4.1 
Bayes  Critical  Points  for  the  Preliminary  t-Test 


"\ 

1 

2 

3 

4 

5 

10 

1.477 

1.449 

1.441 

1.464 

1.549 

12 

1.454 

1.421 

1.398 

1.387 

1.393 

14 

1.442 

1.409 

1.383 

1.363 

1.351 

16 

1.435 

1,403 

1.376 

1.354 

1.336 

18 

1.430 

l.AOl 

1.374 

1.351 

1.332 

20 

1.427 

1.399 

1.374 

1.352 

1.332 

25 

1.422 

1.398 

1.376 

1.356 

1.337 

30 

1.419 

1.399 

1.380 

1.362 

1.345 

50 

1.416 

1.403 

L.390 

1.378 

1.366 

100 

1.415 

1.403 

1.401 

1.395 

1.388 

200 

1.414 

1.411 

1.407 

1.404 

1.401 

1000 

1.414 

1.414 

1.41'-'. 

1.412 

1.411 

It  is  interesting  to  note  that  the  Bayes  critical  point  varies 
quite  little  according  to  the  changes  in  the  values  of  n  and  p.   Also, 
it  is  very  close  to  the  minimax  regret  critical  point  in  Sawa  and 
Hiromatsu  [6] . 
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5.   Bias  of  Decision  Rules 

Now  we  return  to  Section  3  and  reconsider  the  problem  from  the 
viewpoint  of  sampling  theory.  When  we  compare  the  two  nested 
alternative  models  given  in  (3.14),  our  decision  rule  should  be  in 
principle  based  on  the  risk  function  given  in  Theorem  3.1.   That  is,  we 
should  choose  F   if  R(F  ('Is  ))  <  R(F^('|e  ))  and  vice  versa. 

2     2       -1 
Lemma  5.1   If  S  -  a       ~  <^o  =  0  (n  ),  then 

2 

(5.1)     R(F^(.|e^))  -R(F2(-|e2))  =-^--^+0  (n"^)  . 

o^         na^ 

The  proof  is  given  in  the  Appendix.   It  should  be  recalled  that  when 

-2 
we  derived  the  BIC  the  terms  of  0(n  )  were  neglected.   It  is, 

therefore,  consistent  that  we  evaluate  the  difference  of  risk  only  to 

order  0(n  ).   The  difference  between  the  pseudo-variances,  6,  is 

assumed  to  be  0(n   ) .   This  assumption  may  seem  to  be  somewhat 

uncomfortable.  However,  it  may  be  justified  by  the  fact  that  the 

model  discrimination  procedure  v7ouid  be  unnecessary  unless  the  difference 

between  the  two  alternatives  is  as  small  as  the  reciprocal  of  the 

sample  size. 

Hence  we  can  legitimately  define  a  correct  decision  rule  as 

2 
follows:   choose  the  model  F,  if  nS/w  <  q  and  choose  F^  if 

n6 /u   <  q . 

Based  on  the  preceding  consideration,  we  introduce  the 

notion  of  unbiasedness  of  a  decision  rule:   a  decision  rule  is  said  to 
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be  unbiased  if  the  probability  of  choosing  F  is  greater  than  1/2  when 

2  2 

nS/o)   <  q  and  less  than  1/2  when  n6/(,j   >  q.   If  the  probability  decreases 

2 
continuously  with  the  Increase  of  n6/u  ,  the  condition  of  unblasedness 

is  simply  described  as  follovjs:   the  probability  of  choosing  F   (or  F  ) 

9  2 

is  1/2  when  n5/a3"  =  q.   Note  that  when  n6/u)  =  q  we  are  indifferent  to 

the  two  alternative  models.   If  the  above  probability  exceeds  1/2,  then 

the  decision  rule  is  said  to  be  biased  toward  a  simpler  model;  If  it 

falls  below  1/2,  then  the  decision  riile  is  biased  toward  a  more  complex 

model . 

All  decision  rules  considered  so  far  are  based  on  whether  or 

not  an  observed  value  of  W,  given  by  (A, 6),  exceeds  a  constant  which 

2 
changes  with  n,  p,  and  q.   Under  the  assumption  that  Y  -x.  N(p,  to  I  ),  W 

is  distributed  as  a  doubly  noncentral  F  with  (q,  n-p-q)  degrees  of 

freedom  and  the  noncentrality  parameters 

y'x,(x  X,)  X,  V 

(5.2)  «i=-^  =  t-~A~^ ^—         and 

U  CO 

Vi'[I  -  X  (X,X.)   X^  -  X,(X  X  )   X  ]p 

(5.3)  62=  ^—^-^ L__J^_i_-i £ 

*  »   _i  ' 

where  X  =  X  -  X  (X  X  )   X  X.^.   It  would  be  worth  noting  here  that  a 

decision  is  correct  if  we  decide  to  choose  F  when  the  noncentrality 
parameter  of  the  numerator  is  less  than  its  degree  of  freedom  and 
vice  versa . 

In  Table  5.1  we  tabulate  the  probability  that  W  exceeds  the 
BIC  critical  point  when  no fui^   =  q,  i.e.,  when  F.  and  F^  are  indifferent. 
It  can  be  observed  from  the  Table  that  the  BIC  procedure  is  considerably 
biased  toward  a  simpler  model. 
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Table  5.1 
Bias   of   the  BIC  Jlecision  Rule 
noncentrality  n  -   IC n  =  20  n  =   30  n  =   40  n  =   50 


P  =   2 


.0 

.696 

,671 

.664 

.661 

.659 

.1 

.720 

.697 

.690 

.687 

.685 

.2 

.742 

.720 

.714 

.711 

.709 

,3 

.763 

.742 

.736 

.733 

.731 

.4 

.781 

.762 

.756 

,753 

.752 

.5 

.798 

.780 

.774 

.772 

.770 

.6 

.814 

.797 

.791 

.789 

.788 

.7 

.829 

.812 

.807 

.805 

.803 

.8 

.842 

.827 

.822 

.820 

.818 

.9 

.854 

.840 

.835 

.833 

.832 

.1.0 

.866 

.852 

.848 

.846 

.844 

.0 

.738 

.689 

.675 

.669 

.666 

.1 

.760 

.715 

.701 

.695 

.692 

.2 

.781 

,738 

.725 

.719 

.715 

.3 

.800 

.759 

.747 

.741 

,737 

.4 

.817 

.779 

.767 

.761 

.758 

.5 

.833 

.797 

.785 

.779 

.776 

.6 

.848 

.813 

.802 

.796 

.793 

.7 

.861 

.828 

.817 

.812 

.809 

.8 

,873 

.842 

.832 

.827 

.824 

.9 

.884 

.855 

.845 

.840 

.837 

1.0 

.894 

.866 

.857 

.852 

.850 

P  =   3 


Each  entry  in  the  table  is  the  i^robability  that  a  doubly  non-central  F 
variate,  with  noncentrality  parameters  (6  ,  6  )  and  (1,  n  -  p  ~  1) 
degrees  of  freedom,  falls  below  the  BIC  critical  point  when  v5,  =  1.   The 
noncentrality  is  6  / (n  -  p  -  1) ,  i.e.,  the  normalized  noncentrality 
parameter  of  the  denominator  in  F,  where  6^  is  given  by  (5.3). 
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Appendix 
Proof  of  Lemma  3.1 
The  log  likelihood  function  is 

(A.l)     log  f(y|e)  =  -  ^  log  (2Tr)  -  |  log  {a') 

--^  I!y-x3||^ 

2a 

2 
where  6'  =  (6',  c  )  and  |]  •  1|  stands  for  an  Euclidean  norm. 

2 
Differentiating  it  with  respect  to  3  and  o  ,  we  have 

(A.2)    ia£|-|(ylei=i_x'(y-x3). 

(A.3)     l_l£^|(ziil  =  _   n  ^  _1   ,,  ^  _  ^^  ,j2_ 
8  a  2a    2a 

Then 

(A.4)       n^^f-^-}  -^n^-m 

a 

(A.5)     E[^-i5£-f|li^]  .  -  -H^  +  1  E  II  Y  -  X3  f 
3  a  2a    2a 


2a    2a 


(E  II  Y  -  p  II   +   p  -  X8  I  ) 


=  -  ~  +  --r  (tr  il  +  II  M  -  X6  11^). 
2a    2a 

Equating  (A.4)  and  (A.5)  tc  zeroes  and  solving  them  yields  the 

2 
pseudo-true  parameter  values  S  and  a   given,  respectively,  by  (3.3) 

and  (3.4). 
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Proof  of  Lemma  3.2 


(A. 6)     E(3)  =  (X'X)  h'v   =  6q 


(A. 7)     Eiah   =  -  trPyCvip'  +  w^I  ) 
n    A  n 


n  -  k  2,1   ,- 
n        n    A 


where  P..  =  I  -  X(X'X)''"'"X' .   Then 
X 


(A. 8)     lim  E(a^)  =  lim  o^^ 


Proof  of  Lemma  3.3 

From  (A-1)  we  have 

(A. 9)     -  ~   log  f(Yl9)  =  log  (2tt)  +  log  a^   +  -~   \\  Y  -  xl  \\^ 

no'" 

where  Y  is  a  vector  random  variable  independent  of  9.   Taking  expectation 
of  (A. 9)  and  substituting 

(A. 10)    E[|I  Y  -  XP  |1-|.6]  =  Ei!  Y  -  X3^  \\^   -   2  E[(Y  -  XB^)  "XCS  -  3^)  ] 

2 


=  noQ-  +  !i  X(3  -  3(3  II 
therein,  we  obtain  (3.11). 


-f  II  X(S  -  3q) 


na^^  -   2y'Pj.X(3  -  3q)  +  ||  X(0  -  3^)  Ij^ 
2    ,.   "      ,,"> 


Proof  of  Theorem  3.1 


The  risk  function  is 
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(A. 11)     R(F(-!e))  =  E[V7(F(.ie))] 

^  2        2 

=  log  (271)  +  log  (a  ■)  +  E[log  (2^)1  +  E(^) 

a  a 

+  -^2  E(^)  E||x(e  -  6)  ir 

na    c 

"2 
where  use  is  made  of  the  independence  of  a     and  S,  and  the  suffix  0  of 

2 

a^     and  S^  is  dropped.   We  have  the  following  power  series  expansions: 

(A. 12)     log  (^)  =  log  (1  +  A)  =  A  -  Y  A  +  ••• 

G 

2 
(A. 13)    ^  =  j~j   =  1  _  A  +  A^  +  ••• 
a 

where 

^2    2 

(A.IA)    A  =  ^   2  " 
a 

Note  that  under  the  assumptions  stated  in  the  Theorem  the  expectations 

_2 
of  higher  order  terms  in  the  expansions  are  of  order  0(n  ). 

(A.  15)  A  =  -^    [Y'F^Y    -   nu;^   -   y'P,,y] 

na 

1  2 


T  t^'V^  2^'V^  -^ 


na  o 

X'jhere  V  =  Y  -   y.      Under    the  assumptions    in   the  Theorem 

(A. 16)  E(V'P--V)   =  u^trP-^  =    (n   -  k)u)^ 

(A.  17)  E(V'P  V)2   =  J'[(tr?^-)^  +  2trP^] 

=    [(n   -  k)2  +  2(n   -  k)]J 


(A. 18)  E(p'P^V)    =   E[V'P^Vy'P^V]    =   0 
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(A. 19)     £(y'P  V)^  =  Jm'Fu   =  nJi</   -   o)") 

Hence,  rearranging  the  terms,  we  obtain 

(A. 20)     E(A)  =  -  -  fe  , 

V.  / 

c 

2  2  2 

(A. 21)    E(A^)  =  -  {— )  -  -  (^)  +   0(n~^). 
n   z    n   / 
0        a 

Also,  we  have 

(A. 22)    Ell  X(6  -  3)  li^  =  e||  X(X'X)~-^X'V  [1^=  w^trX(X'X)"^X' 

=  kw 

Therefore, 

2        2  _ 

(A. 23)    E[log  (?j)]  +  E  (^)  =  1  +  J  E(A^)  +  0(n  ^) 

2,22 

=  1  +■-  (^)  -i  (\)  +  0(n-2) 
n   2    n   2 
0        a 

(A. 24)     E  (^)  E||  X(6  -  e)  ii^'  =  kw^  +  0(n"^) 
a 

Substituting  (A. 23)  and  (A. 24)  into  (A. 11),  we  finally  obtain  (3.12) 
Proof  of  Theorem  3.2 


From  (A. 12),  (A. 20)  and  (A, 21) 

(A. 25)    E  (log  a^)  =  log  a"  4-  E(A)  -  |  E  (A^) 


2        2        2  2 

log  a (-:r) (-5-)  +  -  i—:^)      +  0(n   ) 

n   <i    n   z    n   z 
a        a       a 


Moreover,  we  have 


"2     2     a^^     2 
(A. 26)    E(%)  =  ^  E  (^)  =  ^  (1  +  0(n~^-)) 


29  - 


and 

^2  2    4   a./  "4        _ 
(A. 27)    E(^)   =  ~  E(^)   =^(l+0(n^) 
o      a         a"  a 

Noting  that 

(A. 28)    -2  log  f(y|e)  =  n  log  (2v)   +  n  log  a"^  +  1 
and  combining  the  above  expectations,  ve   obtain 

(A.29)    nE  [BIC(F(- je))]  =nR(F(.J9))  +  0(n""^). 

Proof  of  Lemma  4.1 

2 
If  we  assume  a  linear  normal  regression  model  Y  '^'  N(Xg,  0  I) 

2 
with  diffuse  prior  for  P  and  o  ,  the  conditional  posterior  distribu- 

■7        "        0  -i  '■     .   -1  , 

tion  of  .8,  given  a",    is  N  (B,  c  (X'X)   )  where  3  =  (X'X)   X'y  is  the 

maximum  likelihood  estimate,  and  also  the  marginal  prior  distribution 

2 
for  a      is  the  inverse  gamma  distribution  with  the  density  function 

o      2  v/2  ,  2 

/A  nns  2    /VS  .     1         ,   vs  . 

^^•^°^   rTW2T  ^-y)  --^i  ^^  (  -  TT^ 

a  2a 

o     » 2 
where  v  =  n  -  k  and  s"^  =  no  /  (n  -  k)  ,   The  proof  is  given  by  Zellner 

[8  ].   The  conditional  expectation  of  j|  X(e  -  B)  ||   with  respect  to  the 

posterior  distribution  is 

(A. 31)    E^|y^^  II  X(S  -  3)  11^  =  E^i^^^  11  X(P  -  hf   +  II  X(B  -  6)  H^ 

=  ka2 

where  the  lower  bound  is  attainable  v/hen  B  -   B;  i.e.,  the  Bayes  estimate 
3  of  6  is  nothing  but  the  ML  estimate.   A  straightforward  integration 
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yields 

iK    -JON    f    (J^\    -       V    2     n  -  k    2 
(A. 32)    E  „   (a  )  = s  = ■ s 

l\  v-2      n-k-2 

CT  I  y 

as  long  as  V  >  2 .   Hence 

2 

(A.33)    E   „   [W(F(-  |e))]  >  log  (2tt)  +  log  a^  +   "  ;^  ^"-    (1  +  -)  % 
Q^^!„  ~  n-k-2      n2 

3, a  jy  o 

2    ~2 
The  Bayes  estimate  of  a  is  a  that  rainimi:;es  the  right-hand  side  of  the 

ahove  inequality;  i.e. 

fK    o/N    "■*2     n  +  k   "2 
(A.  34)    a   = \ r-  a 

n  -  K  -  z 

"2  2 

where  a  is  the  ML  estimate  of  a  .   Substituting  this  into  the  right-hand 

side  of  (A.33),  the  minimum  attainable  Bayes  risk  is  evaluated  as 

follows: 

(A. 35)    B(g  ,  a*^|F)  =  log  l-\   +  log  cr"^  +  1 

"2  n  4-  k 

=  log  2tt  +  log  cr''  +  1  +  log  ( -. t) 

n  -  k  —  /J 


f  logf(y!9)H-log(-^-±-^-2) 


Proof  of  Theorem  4.1 


Let  B^  and  B„  be  the  rciniraum  attainable  Bayes  risks,  respectively,  for 
F  and  F„  with  diffuse  prior  for  parameters.   The  difference  between 


B^  and  B„  is 


"  2 

a 


(A.36,     3,  -  B,  .  lo.  (.ij,  *   10,  [f-:  ^'^"„,(/.-,^-,f  1 


"2 


If  this  is  negative,  we  should  choose  F^ ,  and  vice  versa.   By  the 
monotonicity  of  the  logarithm  transformation,  E   -  B„  <  0  is 
equivalent  to 


)L  - 


"2 

cr. 


^2 


(A.37)    4  <  (n  +  P  +  q)(n  -  p  -  2,^ 
2    (n  +  p)(n  -  p  -  q  -  2) 
'2 

which  is  again  equivalent  to  (4.5). 

Proof  of  Lemma  5 . 1 

2 

(A. 38)    R(F^(-|e))  -  RCF^r-le))  =  log  (-^)  +£-^^  (-I-  _  .-i_-)  ^,2 

^2"      ""    ^2    ^i' 

2 

q  w"   ,1,1     1  -   4  ,  .  ,  -2. 

-n  —  -^n^-4  --T^  (.  +0(n   ) 
"2      ^2    ^1 

If  we  assume  that 

(A. 39)    5  =  a^^  -  a^^   =  0(n~^), 

we  have  an  expansion 

2 
(A. 40)    log(-^)  =  log  (1  +^)  =  -^2  +  0(n~^)- 

Also,  it  follows  that  the  second  and  third  terms  on  the  right-hand  side 
of  (\.38  )  are  of  order  0(n  ")  .   Hence.,  if  we  neglect  the  teruis  of  order 
0(n  "),  we  can  assert  that  R(F  ('IG))  <  R(F  (-je))  if  and  only  if 

(A. 41)    ^  <  q 
and  vice  versa . 
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